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Abstract 

Suppose we can sequentially acquire arbitrary linear measurements of an n-dimensional vector x 
resulting in the linear model y = Ax + z, where z represents measurement noise. If the signal is known 
to be sparse, one would expect the following folk theorem to be true: choosing an adaptive strategy which 
cleverly selects the next row of A based on what has been previously observed should do far better than 
a nonadaptive strategy which sets the rows of A ahead of time, thus not trying to learn anything about 
the signal in between observations. This paper shows that the folk theorem is false. We prove that 
the advantages offered by clever adaptive strategies and sophisticated estimation procedures — no matter 
how intractable — over classical compressed acquisition/recovery schemes are, in general, minimal. 

Keywords: sparse signal estimation, adaptive sensing, compressed sensing, support recovery, infor- 
mation bounds, hypothesis tests. 

1 Introduction 

This paper is concerned with the fundamental question of how well one can estimate a sparse vector 
from noisy linear measurements in the general situation where one has the flexibility to design those 
measurements at will (in the language of statistics, one would say that there is nearly complete freedom 
in designing the experiment). This question is of importance in a variety of sparse signal estimation or 
sparse regression scenarios, but perhaps arises most naturally in the context of compressive sensing (CS) 
[4, 5, 11]. In a nutshell, CS asserts that it is possible to reliably acquire sparse signals from just a few linear 
measurements selected a priori. More specifically, suppose we wish to acquire a sparse signal x G M". A 
possible CS acquisition protocol would proceed as follows. (?) Pick an m x n random projection matrix A 
(the first m rows of a random unitary matrix) in advance, and collect data of the form 

y = Ax + z, (1.1) 

where z is a vector of errors modeling the fact that any real world measurement is subject to at least a 
small amount of noise, (ii) Recover the signal by solving an ii minimization problem such as the Dantzig 
selector [6] or the LASSO [28]. As is now well known, theoretical results guarantee that such convex 
programs yield accurate solutions. In particular, when z = 0, the recovery is exact, and the error degrades 
gracefully as the noise level increases. 

A remarkable feature of the CS acquisition protocol is that the sensing is completely nonadaptive; that 
is to say, no effort whatsoever is made to understand the signal. One simply selects a collection {aj} 
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of sensing vectors a priori (the rows of the matrix A), and measures correlations between the signal and 
these vectors. One then uses numerical optimization — e.g., linear programming [6] — to tease out the sparse 
signal X from the data vector y. While this may make sense when there is no noise, this protocol might 
draw some severe skepticism in a noisy environment. To see why, note that in the scenario above, most of 
the power is actually spent measuring the signal at locations where there is no information content, i.e., 
where the signal vanishes. Specifically, let a be a row of the matrix A which, in the scheme discussed 
above, has uniform distribution on the unit sphere. The dot product is 



and since most of the coordinates xj are zero, one might think that most of the power is wasted. Another 
way to express all of this is that by design, the sensing vectors are approximately orthogonal to the signal, 
yielding measurements with low signal power or a poor signal-to- noise ratio (SNR). 

The idea behind adaptive sensing is that one should localize the sensing vectors around locations where 
the signal is nonzero in order to increase the SNR, or equivalently, not waste sensing power. In other 
words, one should try to "learn" as much as possible about the signal while acquiring it in order to design 
more effective subsequent measurements. Roughly speaking, one would (i) detect those entries which are 
nonzero or significant, (ii) progressively localize the sensing vectors on those entries, and (iii) estimate the 
signal from such localized linear functionals. This is akin to the game of 20 questions in which the search is 
narrowed by formulating the next question in a way that depends upon the answers to the previous ones. 
Note that in some applications, such as in the acquisition of wideband radio frequency signals, aggressive 
adaptive sensing mechanisms may not be practical because they would require near instantaneous feedback. 
However, there do exist applications where adaptive sensing is practical and where the potential benefits 
of adaptivity are too tantalizing to ignore. 

The formidable possibilities offered by adaptive sensing give rise to the following natural "folk theorem." 

Folk Theorem. The estimation error one can get by using a clever adaptive sensing scheme 
is far better than what is achievable by a nonadaptive scheme. 

In other words, learning about the signal along the way and adapting the questions (the next sensing 
vectors) to what has been learned to date is bound to help. In stark contrast, the main result of this paper 
is this: 

Surprise. The folk theorem is wrong in general. No matter how clever the adaptive sensing 
mechanism, no matter how intractable the estimation procedure, in general it is not possible to 
achieve a fundamentally better mean-squared error (MSE) of estimation than that offered by a 
naive random projection followed by ii minimization. 

The rest of this article is mostly devoted to making this claim precise. In doing so, we shall also show 
that adaptivity does not help in obtaining a fundamentally better estimate of the signal support, which is 
of independent interest. 

1.1 Main result 

To formalize matters, we assume that the error vector z in (1.1) has i.i.d. A/'(0, o"^) entries. Then if A is 
a random projection with unit-norm rows as discussed above, [6] shows that the Dantzig selector estimate 
x^^ (obtained by solving a simple linear program) achieves an MSE obeying 



n 
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x||i<C-log(n)a2, 



(1.2) 
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where C is some numerical constant. The bound holds universally over all A;-sparse signals^ provided that 
the number of measurements m is sufficiently large (on the order of at least k\og{n/k)). Moreover, one can 
show that this result is essentially optimal in the sense that any possible nonadaptive choice of A (with 
unit-norm rows) and any possible estimation procedure x will satisfy 

-EP-x||i > C7'— log(n//fc)a^ (1.3) 
n m 

where C is a numerical constant [3]. The fundamental question is thus: how much lower can the MSE he 
when (i) we are allowed to sense the signal adaptively and {ii) we can use any estimation algorithm we 
like to recover x. 

The distinction between adaptive and nonadaptive sensing can be expressed in the following manner. 
Begin by rewriting the statistical model (1.1) as 

yj = (ai,x) + Zj, z = l,...,m, (1.4) 

in which a power constraint imposes that each a, is of norm at most 1, i.e., ||aj||2 < 1; then in a nonadaptive 
sensing scheme the vectors ai , . . . , a^ are chosen in advance and do not depend on x or z whereas in an 
adaptive setting, the measurement vectors may be chosen depending on the history of the sensing process, 
i.e., aj is a (possibly random) function of (ai,yi, . . . ,aj_i,yj_i). 

If we follow the principle that "you cannot get something for nothing," one might argue that giving up 
the freedom to adaptively select the sensing vectors would result in a far worse MSE. Our main contribution 
is to show that this is not the case. 

Theorem 1. Suppose that k < n/2 and let m be an arbitrary number of measurements. Assume that x is 
sampled with i.i.d. coordinates such that xj = with probability 1 — k/n and xj = // with probability k/n 
(so that we have k nonzero entries on the average). Then for /i = Ia/^, any sensing strategy and any 
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estimate x obey 

^E||x-xi> 1A^2. (1.5) 

n 27 m i m 



For any n and k, the number of nonzero entries in a random vector drawn from the Bernoulli prior is 
between k =b 3Vk with probability at least 99%. With some additional arguments, and when n and k are 
sufficiently large, we can actually show that the last inequality in (1.5) holds true in a minimax sense when 
X is known to have a support size in that range. In order to avoid unnecessary technicalities, we prove a 
simpler result. 

Theorem 2. For any n >2 and k < n/2, and any m, 

1 k 
inf sup — E ||x — x||2 > Cfc — o"^, (1-6) 

X ||x||o<fc ^ ^ 

in which inffc>i Cfc > 1/33. For k > Id we can take > 1/15, and when k is sufficiently large we can 
take Ck = 1/7. 



In short. Theorems 1 and 2 say that if one ignores a logarithmic factor, then adaptive measurement 
schemes cannot (substantially) outperform nonadaptive strategies. While seemingly counterintuitive, we 
find that precisely the same sparse vectors which determine the minimax rate in the nonadaptive setting 
are essentially so difficult to estimate that by the time we have identified the support, we will have already 
exhausted our measurement budget (i.e., we will have acquired all m measurements). 

'^A signal is said to be k-sparse if it has at most k nonzero components. We also occasionally use the notation ||x||o to 
denote the number of nonzero components of x. 
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Before moving on, we should clarify precisely what we mean by a substantial improvement. After all, 
the lower bound in Theorem 2 does improve upon the nonadaptive bound in (1.3) by a factor of log(n/A:). 
Indeed, we will see in Section 4 that at least in some very special cases (e.g., when A: = 1), this log factor 
can in fact be eliminated. However, this is a relatively modest improvement compared to what one might 
hope to gain by exploiting adaptivity. Specifically, consider a simple adaptive procedure that uses m/2 
measurements to identify the support of x and uses the remaining m/2 measurements to estimate the 
values of the nonzeros. If such a scheme identifies the correct support, then it is easy to show that this 
procedure will yield an estimate satisfying 

1 TT-l 11^ Il2 ^ 2 

-Ex- X 2 = 

n n m 

Thus, there seems to be room for reducing the error by a factor of k/n beyond the log(n//c) factor. 
Theorem 2, however, shows that this gain is not possible in general. 

On the one hand, our main result states that one cannot universally improve on bounds achievable via 
nonadaptive sensing strategies. Indeed, we will see that there are natural classes of sparse signals for which, 
even after applying the most clever sensing scheme and the most subtle testing procedure, one would still 
not be sure about where the nonzeros lie. This remains true even after having used up the entirety of our 
measurement budget. On the other hand, our result does not say that adaptive sensing never helps. In 
fact, there are many instances in which it will. For example, when some or most of the nonzero entries in 
X are sufficiently large, they may be detected sufficiently early so that one can ultimately get a far better 
MSE than what would be obtained via a nonadaptive scheme, see Section 3 for simple experiments in this 
direction and Section 4 for further discussion. 

1.2 Connections with testing problems 

The arguments we develop to reach our conclusions are quite intuitive, simple, and yet they seem different 
from the classical Fano-type arguments for obtaining information-theoretic lower bounds (see Section 1.3 for 
a discussion of the latter methods). Our approach involves proving a lower bound for the Bayes risk under 
the prior from Theorem 1. To obtain such a lower bound, we make a detour through testing — multiple 
testing to be exact. Our argument proceeds through two main steps: 

• Support recovery in Hamming distance. We consider the multiple testing problem of deciding which 
components of the signal are zero and which are not. We show that no matter which adaptive strategy 
and tests are used, the Hamming distance between the estimated and true supports is large. Put 
differently, the multiple testing problem is shown to be difficult. In passing, this establishes that 
adaptive schemes are not substantially better than nonadaptive schemes for support recovery. 

• Estimation with mean-squared loss. Any estimator with a low MSE can be converted into an effective 
support estimator simply by selecting the largest coordinates or those above a certain threshold. 
Hence, a lower bound on the Hamming distance immediately gives a lower bound on the MSE. 

The crux of our argument is thus to show that it is not possible to choose sensing vectors adaptively in 
such a way that the support of the signal may be estimated accurately. 

1.3 Differential entropies and Fano-type arguments 

Our approach is significantly different from classical methods for getting lower bounds in decision and 
information theory. Such methods typically rely on Fano's inequality [9], and are all intimately related to 
methods in statistical decision theory (see [29, 31]). Before continuing, we would like to point out that 
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Fano-type arguments have been used successfully to obtain (often sharp) lower bounds for some adaptive 
methods. For example, the work [8] uses results from [29] to establish a bound on the minimax rate for 
binary classification (see the references therein for additional literature on active learning). Other examples 
include the recent paper [26], which derives lower bounds for bandit problems, and [24] which develops an 
information theoretic approach suitable for stochastic optimization, a form of online learning, and gives 
bounds about the convergence rate at which iterative convex optimization schemes approach a solution. 

Following the standard approaches in our setting leads to major obstacles that we would like to briefly 
describe. Our hope is that this will help the reader to better appreciate our easy itinerary. As usual, 
we start by choosing a prior for x, which we take having zero mean. Coming from information theory, 
one would want to bound the mutual information between x (what we want to learn about) and y (the 
information we have), for any measurement scheme ai, . . . ,am- Assuming a deterministic measurement 
scheme, by the chain rule, we have 

m 

/(x,y) = h{y) -/i(y|x) = J^KVilyii-i]) - HVilyii-i],^)^ (1-7) 

1=1 

where y^j] := (yi, . . . , yj). Since the history up to time i — 1 determines a^, the conditional distribution of m 
given y[i^i] and x is then normal with mean (aj,x) and variance o"^. Hence, h{yi \ y[j_i],x) = ^ log(27refT^). 
This is the easy term to handle — the challenging term is h{yi \ y[i-i]) and it is not clear how one should go 
about finding a good upper bound. To see this, observe that 

Var(yj | y[i_i]) = Var((ai,x) | y[i-i]) + cr^. 

A standard approach to bound h{yi \ y[i_i]) is to write 

Hyi\yii-i]) < ^IElog(27reVar((ai,x) I y[i_i]) + 27recj^), 

using the fact that the Gaussian distribution maximizes the entropy among distributions with a given 
variance. If we simplify the problem by applying Jensen's inequality, we obtain 

/(x, y) < E 2 iog(^E(a- ^) V^' + 1) • (1-8) 

i=l 

The RHS needs to be bounded uniformly over all choices of measurement schemes, which is a daunting 
task given that aj is a function of y[i-i] which is in turn a function of x. We note however that the RHS 
can be bounded in the nonadaptive setting, which is the approach taken in [3] to establish (1.3). See 
also [25, 1, 30] for other asymptotic results in this direction. 

We have presented the problem in this form to help information theorists see the analogy with the 
problem of understanding the role of feedback in a Gaussian channel [9]. Specifically, we can view the 
inner products (aj,x) as inputs to a Gaussian channel where we observe the output of the channel via 
feedback. It is well-known that feedback does not substantially increase the capacity of a Gaussian channel, 
so one might expect this argument to be relevant to our problem as well. Crucially, however, in the case 
of a Gaussian channel the user has full control over the channel input — whereas in the absence of a priori 
knowledge of x, in our problem we are much more restricted in our control over the "channel input" (aj, x). 

1.4 Connections with other works 

A number of papers have studied the advantages (or sometimes the lack thereof) offered by adaptive 
sensing in the setting where one has noiseless data, see for example [11, 23, 16] and references therein. 
Of course, it is well known that one can uniquely determine a A;-sparse vector from 2k linear nonadaptive 
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noise-free measurements and, therefore, there is not much to dweU on. The aforementioned works of course 
do not study such a trivial problem. Rather, the point of view is that the signal is not exactly sparse, 
only approximately sparse, and the question is thus whether one can get a lower approximation error by 
employing an adaptive scheme. Whereas we study a statistical problem, this is a question in approximation 
theory. Consequently, the techniques and results of this line of research have no bearing on our problem. 

There is much research suggesting intelligent adaptive sensing strategies in the presence of noise and 
we mention a few of these works. In a setting closely related to ours — that of detecting the locations of 
the nonzeros of a sparse signal from noisy point samples (so that m > n) — [14] shows that by adaptively 
allocating sensing resources one can significantly improve upon the best nonadaptive schemes [12]. Lower 
bounds for nonadaptive and adaptive methods in this context were recently established in [20], with the 
adaptive lower bounds established through the sequential probability ratio test (SPRT) [27]. Closer to 
home, [15, 13] consider CS schemes (with m < n) which perform sequential subset selection via the 
random projections typical of CS, but which focus in on promising areas of the signal. When the signal is 
[i) very sparse (u) has sufficiently large entries and {in) has constant dynamic range, the method in [13] 
is able to remove a logarithmic factor from the MSE achieved by the Dantzig selector with (nonadaptive) 
i.i.d. Gaussian measurements. In a different direction, [7, 18] suggest Bayesian approaches where the 
measurement vectors are sequentially chosen so as to maximize the conditional differential entropy of yi 
given y[i-i]. Finally, another approach in [17] suggests a bisection method based on repeated measurements 
for the detection of 1-sparse vectors, subsequently extended to fc-sparse vectors via hashing. None of these 
works, however, establish a lower bound on the MSE of the recovered signal. 

1.5 Content 

We prove all of our results in Section 2, trying to give as much insight as possible as to why adaptive 
methods are not much more powerful than nonadaptive ones for detecting the support of a sparse signal. 
We will also attempt to describe the regime in which adaptivity might be helpful via simple numerical 
simulations in Section 3. These simulations show that adaptive algorithms are subject to a fundamental 
phase transition phenomenon. Finally, we comment on open problems and future research in Section 4. 

2 Limits of Adaptive Sensing Strategies 

This section establishes nonasymptotic lower bounds for the estimation of a sparse vector from adaptively 
selected noisy linear measurements. To begin with, we remind ourselves that we collect possibly adaptive 
measurements of the form (1.4) of an n-dimensional signal x where ||aj||2 < 1; from now on, we assume for 
simplicity and without loss of generality that a = 1. 

In our analysis below, we denote the total-variation metric between any two probability distributions P 
and Q by ||P — Q||tV; and their KL divergence by ir(P,Q) [22]. Our arguments will make use of Pinsker's 
inequality, which relates these two quantities via 



We shall also use the convexity of the KL divergence, which states that for Aj > and Aj = 1, we have 



in which {Pj} and {Qj} are families of probability distributions. 

Before proceeding, we argue that when we are given a prior '/r(x), we can restrict ourselves to determin- 
istic measurement schemes in the sense that ai is a deterministic vector and, for i > 2, a, is a deterministic 






(2.2) 
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function of y[i-i] = (yi, . . . , yi). In the general case we have = -Fi(y[i_i], Ui), where Fi is a deterministic 
function and Ui is random and independent of y[i-i] and Zi. With U = (Ui, . . . , Um), it fohows from the 
law of iterated expectation 

E[p-xf|U] 



E||X-x|P = E 



(the expectation in the left-hand side is taken over x, y and U) that there exists a fixed realization 
u = (ui, . . . ,Um) obeying 

E[||X-xf |U = u] <E||X-xf . 

Hence, we can construct an estimator based on a deterministic measurement scheme which is as good as 
any based on a randomized measurement scheme. Note that in a deterministic scheme, letting Px be the 
distribution of y[i-i] when the target vector is x and using the fact that yi is conditionally independent of 
y[i-i] given aj, we see that the likelihood factorizes as 



ciy[m]) = llP^{yMi), (2.3) 



1=1 



which will be of use in our analysis below. 



2.1 The Bernoulli prior 

We begin by studying the model in Theorem 1 which makes our argument most transparent. The proof of 
Theorem 2 essentially reduces to that of Theorem 1. 

In this model, we suppose that x S M" is sampled from a product prior: for each j € {1, . . . , n}, 

1/1 w.p. k/n, 

and the Xj's are independent. In this model, x has on average k nonzero entries, all with known positive 
amplitudes equal to /i. This model is easier to study than the related model in which one selects k 
coordinates uniformly at random and sets those to fj,. The reason is that in this Bernoulli model, the 
independence between the coordinates of x brings welcomed simplifications, as we shall see. 

Our goal here is to establish a lower bound on the MSE when x is drawn from this prior. We do this in 
two steps. First, we look at recovering the support of x, which is done via a reduction to multiple testing. 
Second, we show that a lower bound on the error for support recovery implies a lower bound on the MSE, 
leading to Theorem 1. 



2.1.1 Support recovery in Hamming distance 

We would like to understand how well we can estimate the support S = {j : xj 7^ 0} of x from the data 
(1.4), and shall measure performance by means of the expected Hamming distance. Here, the error of a 
procedure 5 for estimating the support S is defined as 

n 

E\SAS\ = J2nS,^S,) 
i=i 

where A denotes the symmetric difference, Sj = 1 if j E 5" and equals zero otherwise, and similarly for Sj. 
As we can see, this reduces our problem to a sequence of n independent hypothesis tests. We will obtain 
a lower bound on the number of errors among these tests by exploiting the following lemma. 
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Lemma 1. Consider the testing problem of deciding between Hq : x ~ Pq ^^^d Hi : x ~ Pi, where Hq and 
Hi occur with prior probabilities ttq and vri respectively. Under the 0-1 loss, The Bayes risk B obeys 

S>min(7ro,^i)(l-||Pi-Po||Tv). 

Proof. Assume without loss of generality that vri < vro. The test with minimum risk is the Bayes test 
rejecting Hq if and only if 



TTo ro^xj 

that is, if the adjusted likelihood ratio exceeds one; see [19, Pbm. 3.10]. A simple calculation shows that 
the Bayes risk obeys 

B = 7roEo (min(l,A)), 

where Eg denotes expectation under Pq. Using the fact that Eq A = vti/tto together with 

, 1 + A |A-1| 
min(l, A) = — h 



2 2 ' 

we obtain 

5 = ^-f Eo|A-l|. (2.5) 

Finally, 

vroEo |A - 1| = j IvridPi -vrodPo I ^ ^ 1^^^! -"^1^0 I + ttq - tti 

= 27ri||Pi-Po||TV + vro-vri, 

which when combined with (2.5) establishes the lemma. □ 

Theorem 3. Suppose that x is sampled according to the Bernoulli prior with k < n/2, then any estimate 
S obeys 

E|^A5|>A:(l-^y^). (2.6) 

Hence, if the amplitude of the signal is below y^n/m, we expect a large number of errors; indeed, if 
^ = ^n/m, then E I^A^I > k/2. 

Proof 2 Let vri = k/n and ttq = 1 - vri. For any j, set Pqj = F{-\xj = 0) and Pij = F{-\xj / 0). Let Bj 
denote the Bayes risk of the decision problem Hqj : xj = versus Hij : xj = 1. From Lemma 1 we have 
that 

n n n 

E \SAS\ = Y^nSj + Sj) > > ^1 5^(l - llPij - PojIItv 

j=i j=i j=i 

Applying the Cauchy-Schwartz inequality, we obtain 



E\SAS\>k{l ^ 

In 



EI|IPi.:'--IPo.^-|Itv)- (2-7) 



The theorem is a consequence of (2.7) combined with 



" „2 

^||Pi,j-Poj|||v < (2- 



■^The main ideas of our proof are similar to those in that of Assouad's Lemma, see [2, 29] for instance. Note, however, that 
our approach yields a sharper constant. 
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To establish (2.8), we apply Pinsker's inequality twice to obtain 

llPi,. -IPo,,|||v < y^(IPo,„Pi,,) + yi^(Pi,i,Po,,) (2.9) 

so that it remains to find an upper bound on the KL divergence between Pq.j and Pij. Write Pq = Po,j 
for short and likewise for Pij-. Then 

x' x' 

where x' = (xi, . . . , Xj+i, . . . , x„) and Po,x' is the conditional probability distribution of y^m] given x' 
and Xj = 0; Pi(y[m]) is defined similarly. The convexity of the KL divergence (2.2) gives 

i^(Po,Pi) < J]P(x')i^(Po,x',Pi,x')- (2.10) 

x' 

We now calculate this divergence. In order to do this, observe that we have yi = (a,, x) + Zi = Ci + Zi under 
Pq^x' while yi = aij^ + Ci + Zi under Pi^x'- This yields 

i^(P0,x',Pl,x') =Eo,x' log -^°'"' 



l,x' 



/I 1 \ 

i=i ^ ^ 

m 



i=l 
2 ™ 

= ^5^Eo,x'(a?,). 

i=l 

The first equality holds by definition, the second follows from (2.3), the third from yi = Ci + Zi under Po,x' 
and the last holds since Zi is independent of ajj and has zero mean. Using (2.10), we obtain 

2 ^ 

i^(Po,Pi)< '^5^E[a2jx,=0]. 

i=l 

Similarly, 

2 ™ 

i^(Pi,iPo)<^E^Kii^^ = ^] 

i=l 

and, therefore, (2.9) shows that 

2 m 2 

i=l i=l 

For any particular pair («, j) with i > 1, we can say very little about E[a?j] since it can depend on all the 
previous measurements in a potentially very complicated manner. However, by summing this inequality 
over j we can obtain (2.8) by using the only constraint we have imposed on the aj, namely, ||ai||2 = 1, so 
that V,-,- a?- = m. This establishes the theorem. □ 
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2.1.2 Estimation in mean-squared error 



It is now straightforward to obtain a lower bound on the MSE from Theorem 3. 
Proof of Theorem 1. Let S be the support of x and set S := {j : \xj\ > /i/2}. We have 
P - x||i = Y.{x, - x,f + Y,x]>^\S\S\ + ^\S\S\ = ^\S^S\ 

and, therefore, 



where the last inequality is from Theorem 3. We then plug in /i = | simplify to conclude. □ 

2.2 The conditional Bernoulli prior and minimax bound 

To establish Theorem 2, we choose as distribution on x the prior v^^k defined as follows: we start with the 
Bernoulli prior TTn,ak (2-4) with mean ak (instead of h) for some fixed a G (0, 1), and then condition that 
distribution to realizations with at most k nonzero entries. 



Proposition 1. Suppose that x is sampled according to v^^k with k < n/2, then any estimate S obeys 

2V nJ' 



E|5A5| >afc(l-7n,fc(a)-(^J-), (2.11) 



where 



a 

j=k+i 



1 " 

- V (2 + (j - l)/fc)P(Bin(n,a/c/n) 

rv ^ — ^ 



Proof. We begin by arguing that we can restrict attention to estimates S with cardinality at most 2k — 1. 
To see why, consider an arbitrary estimate S and set 



Sk 



5, |5|<2A;-1, 
0, |5| > 2k. 



Now if IS"! > 2k, then for any 5 with IS"] < k, we have 

\SAS\ >\S\S\> \S\ - |5| > A: > |0A5|. 

Since \S\ < k under it follows that Ej^ASI > E|S'fcAS'|, which proves the claim. From now on, we 
assume that |5| < 2k. 

Set '^n,ak{k) — ■P'x^Tr^ cjfc (1 5*1 ^ A:) and observe the identity 



\SAS\ 1 15| <k]= E._„,„, \\SAS\ l{\s\<k} 



To conclude, Theorem 3 together with |5A5| < |5| + jS'l < 2/;; - 1 + jS"] give 

^uMk) IEx~.„,, |SA5| = E,^^„,^, |5AS| - E,^^„,„, \S/\S\ l{\s\>k+i} 

>ak{l-^J^\- i2k-l+j)FiBm{n,ak/n)=j) 
^ j=k+i 



□ 
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We do as in Section 2.1.2 to conclude the proof of Theorem 2. Let 7 be a short for 7n,fc(o)- We find 
that the optimal choice is /x = |(1 — 7) y^n/m, yielding the lower bound 

Ah 

Ep-x||2 >a(l-7) — (2.12) 

To obtain a bound on 7, note that we can write 

ak-fn,k{'^) = 3kF{Bm{n,ak/n)>k + l)+ ^ F{Bm{n,ak/n) > j). 

j>k+2 

Bennett's inequality applied to the binomial distribution, gives 

P(Bin(m,p) > j) < exp [ — j log(j/ (mp)) + j — mp^ . (2-13) 

Therefore, if j > /c, we have 

P(Bin(n, afc/n) > j) < exp [ — j log{j / {ak)) + j — ak'j < exp[— P := a — 1 — loga, 

where the last inequality follows from the fact that the exponent is increasing in k over the range (0, j'/a). 
Note that /3 > for any a < 1. Applying this inequality, we get 

„-(fc+2)/3 

afc7n,fc(a) < 3A:e-('=+i)'^ + ^ (3^ + ^K^'^'^^ (2-14) 

when /3 > log 2. This bound yields 7 < 1 for all A; > 1 when a < 0.03, in which case (2.11) and 
(2.12) become meaningful. This bound is quite conservative, however. Using the definition of 7, we can 
numerically show that that by choosing a appropriately we can obtain a(l - 7) > le'^^^ - 1 > 0.21 for 
any n > 2 and all k < n/2. Thus we can always write a(l — 7)^ > g^- While setting = ^ ensures 
that Theorem 2 holds for any possible choice of /c, it is somewhat pessimistic in the sense that it is entirely 
dictated by the special case of A: = 1 (which could be handled more efficiently by alternative means [10]). 
For larger values of k, it is possible to obtain an improved constant. For example, when A: > 10 numerical 
calculations show that we can take = j^- Moreover, in view of the first inequality in (2.14), and the 
fact that /3 > for all q < 1, we have 7n,fc(a) — )• as A; — t- cxd and a is held fixed. Thus, for a sufficiently 
close to 1 we will have that 0(1 — 7)^ > ^ for k sufficiently large. We have also verified this numerically. 
Hence, the numerical constant i of Theorem 1 is also valid in Theorem 2 provided k is sufficiently large. 



3 Numerical Experiments 

In order to briefly illustrate the implications of the lower bounds in Section 2 and the potential limitations 
and benefits of adaptivity in general, we include a few simple numerical experiments. To simplify our 
discussion, we limit ourselves to existing adaptive procedures that aim at consistent support recovery: the 
adaptive procedure from [7] and the recursive bisection algorithm of [17]. 

We emphasize that in the case of a generic A;-sparse signal, there are many possibilities for adaptively 
estimating the support of the signal. For example, the approach in [13] iteratively rules out indices and 
could, in principle, proceed until only k candidate indices remain. In contrast, the approaches in [7] and [17] 
are built upon algorithms for estimating the support of 1-sparse signals. An algorithm for a 1-sparse signal 
could then be run k times to estimate a A:-sparse signal as in [7], or used in conjunction with a hashing 
scheme as in [17]. Since our goal is not to provide a thorough evaluation of the merits of all the different 
possibilities, but merely to illustrate the general limits of adaptivity, we simplify our discussion and focus 
exclusively on the simple case of one-sparse signals, i.e., where A; = 1. 
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Algorithm 1 Adaptive algorithm from [7] 



input: m X n random matrix B with i.i.d. Rademacher (±1 with equal probability) entries, 
initialize: p = ^(1, . . . , 1)^. 
for i = 1 to i = m do 

Compute ai = . . . , bi^n^/PnV ■ 

Observe yi = {ai, x) + Zj. 

Update posterior distribution p of x given (ai, yi), . . . , (aj, yj) using the rule in [7]. 
end for 

output: Estimate for support(x) is the index where p attains its maximum value. 



Algorithm 2 Recursive bisection algorithm of [17] 
input: mi, ... , rris^,^. 

initialize: jf ^ = {1, . . . , f }, = {f + 1, . . . , n}. 

for S = 1 to S = Smax do 

Construct the rris x n matrix A^'^^ with rows |J;[*''|~2l (s) — |J2*'*|~2l 
Observe y^^^ = A^^^x + z^*). 
Compute u-W = ES^f^- 

Subdivide: Update and ^2^^^'' by partitioning if w^''^^ > or Jg*^ if w^''^^ < 0. 

end for 

output: Estimate for support(x) is if u,(^n.ax) > g, J^'-^^'") if ?x;(«max) < g. 



Specifically, in our experiments we will consider the uniform prior on the set of vectors with a single 
nonzero entry equal to /x > as in Section 2. Since we are focusing only on the case of A; = 1, the algorithms 
in [7] and [17] are extremely simple and are shown in Algorithm 1 and Algorithm 2 respectively. Note 
that in Algorithm 1 the step of updating the posterior distribution p consists of an iterative update rule 
given in [7] and does not require any a priori knowledge of the signal x or In Algorithm 2, we simplify 
the recursive bisection algorithm of [17] using the knowledge that fJ, > 0, which allows us to eliminate the 
second stage of the algorithm aimed at detecting negative coefficients. Note that this algorithm proceeds 
through Smax = log2 n stages and we must allocate a certain number of measurements to each stage. In 
our experiments we set nis = [/32~*], where (3 is selected to ensure that Yl^s=i"^ < 'm- 

3.1 Evolution of the posterior 

We begin by showing the results of a simple simulation that illustrates the behavior of the posterior 
distribution of x as a function of fx for both adaptive schemes. Specifically, we assume that m is fixed and 
collect m measurements using each approach. Given the measurements y, we then compute the posterior 
distribution p using the true prior used to generate the signal, which can be computed using the fact that 

Pj (xexp(^-^\\y - fiAejWi^ , (3.1) 

where cr^ is the noise variance and ej denotes the jth element of the standard basis. What we expect is that 
once /i exceeds a certain threshold (which depends on m), the posterior will become highly concentrated 
on the true support of x. To quantify this, we consider the case where j* denotes the true location of the 
nonzero element of x and define 

Note that when A < 1, we cannot reliably detect the nonzero, but when A 1 we can. 
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(a) (b) (c) 

Figure 1: Behavior of the posterior distribution as a function of fi for several values of m. (a) shows the results for 
nonadaptive measurements, (b) shows the results for Algorithm 1. (c) shows the results for Algorithm 2. We see 
that Algorithm 2 is able to detect somewhat weaker signals than Algorithm 1. However, for both cases we observe 
that once ^ exceeds a certain threshold proportional to yjn/m, the ratio A of pj* to the second largest posterior 
probability grows exponentially fast, but that this does not differ substantially from the behavior observed in (a) 
when using nonadaptive measurements. 

In Figure 1 we show the results for a few representative values of m (a) when using nonadaptive 
measurements, i.e., a (normalized) i.i.d. Rademacher random matrix A, compared to the results of (b) 
Algorithm 1, and (c) Algorithm 2. For each value of m and for each value of /x, we acquire m measurements 
using each approach and compute the posterior p according to (3.1). We then compute the value of A. We 
repeat this for 10,000 iterations and plot the median value of A for each value of ^ for all three approaches. 
In our experiments we set n = 512 and = 1. We truncate the vertical axis at 10^ to ensure that all 
curves are comparable. We observe that in each case, once /i exceeds a certain threshold proportional to 
■s/rilm^ the ratio A of pj* to the second largest posterior probability grows exponentially fast. As expected, 
this occurs for both the nonadaptive and adaptive strategies, with no substantial difference in terms of 
how large /x must be before support recovery is assured (although Algorithm 2 seems to improve upon the 
nonadaptive strategy by a small constant). 

3.2 MSE performance 

We have just observed that for a given number of measurements m, there is a critical value of \i below 
which we cannot reliably detect the support. In this section we examine the impact of this phenomenon 
on the resulting MSE of a two-stage procedure that first uses = pm adaptive measurements to detect 
the location of the nonzero with either Algorithm 1 or Algorithm 2 and then reserves nie = {1 — p)m 
measurements to directly estimate the value of the identified coefficient. It is not hard to show that if we 
correctly identify the location of the nonzero, then this will result in an MSE of (men)~^ = ((1 — p)mn)~^ . 
As a point of comparison, if an oracle provided us with the location of the nonzero a priori, we could 
devote all m measurements to estimating its value, with the best possible MSE being Thus, if we can 
correctly detect the nonzero, this procedure will perform within a constant factor of the oracle. 

We illustrate the performance of Algorithm 1 and Algorithm 2 in terms of the resulting MSE as a 
function of the amplitude /x of the nonzero in Figure 2. In this experiment we set n = 512 and m = 128 
with p = 2 so that rrid = 64 and nie = 64. We then compute the average MSE over 100,000 iterations 
for each value of and for both algorithms. We compare this to a nonadaptive procedure which uses a 
(normalized) i.i.d. Rademacher matrix followed by orthogonal matching pursuit (OMP). Note that in the 
worst case the MSE of the adaptive algorithms is comparable to the MSE obtained by the nonadaptive 
algorithm and exceeds the lower bound in Theorem 2 by only a small constant factor. However, when ^ 
begins to exceed a critical threshold, the MSE rapidly decays and approaches the optimal value of 
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Figure 2: The performance of Algorithm 1 and Algorithm 2 in the context of a two-stage procedure that hrst uses 
md = Y adaptive measurements to detect the location of the nonzero and then uses m^, = y measurements to 
directly estimate the value of the identified coefEcient. We show the resulting MSE as a function of the amplitude 
fi of the nonzero entry, and compare this to a nonadaptive procedure which uses a (normalized) i.i.d. Rademacher 
matrix followed by OMP. In the worst case, the MSE of the adaptive algorithms is comparable to the MSE obtained 
by the nonadaptive algorithm and exceeds the lower bound in Theorem 2 by only a small constant factor. When 
fi begins to exceed this critical threshold, the MSE of the adaptive algorithms rapidly decays below that of the 
nonadaptive algorithm and approaches which is the MSE one would obtain given measurements and a 

priori knowledge of the support. 

Note that when /i is large we can take me — )• m and hence can actually get arbitrarily close to ^ in the 
asymptotic regime. 



4 Discussion 



The contribution of this paper is to show that if one has the freedom to choose any adaptive sensing 
strategy and any estimation procedure no matter how complicated or computationally intractable, we 
would not be able to universally improve over a simple nonadaptive strategy that simply projects the 
signal onto a lower dimensional space and perform recovery via ii minimization. This "negative" result 
should not conceal the fact that adaptivity may help tremendously if the SNR is sufficiently large, as 
illustrated in Section 3. Hence, we regard the design and analysis of effective adaptive schemes as a subject 
of important future research. At the methodological level, it seems important to develop adaptive strategies 
and algorithms for support estimation that are as accurate and as robust as possible. Further, a transition 
towards practical applications would need to involve engineering hardware that can effectively implement 
this sort of feedback, an issue which poses all kinds of very concrete challenges. Finally, at the theoretical 
level, it would be of interest to analyze the phase transition phenomenon we expect to occur in simple 
Bayesian signal models. For instance, a central question would be how many measurements are required 
to transition from a nearly flat posterior to one mostly concentrated on the true support. 

In closing, we note that after the submission of this paper, a variant of Algorithm 2 was shown to 
recover the correct support of a 1-sparse vector with high probability provided that the amplitude fi of the 
nonzero entry obeys fi > C\J^rifrn for some positive numerical constant C [10, 21]. This implies that for 
A; = 1, the lower bound in Theorem 2 is tight up to constant factors. Thus, adaptive methods have the 
potential to remove the log(n/A;) factor required in the nonadaptive setting. 



14 



Acknowledgements 



The authors would hke to thank the reviewers as weU as Rui Castro, Jarvis Haupt, and Alexander Tsybakov for their 
insightful feedback. They are grateful to Xiaodong Li for suggesting an improvement in the proof of Theorem 3 and 
to Adam Bull for pointing out a technical error. E. A-C. is partially supported by ONR grant N00014-09-1-0258. 
E. C. is partially supported by NSF via grant CCF-0963835 and the 2006 Waterman Award, by AFOSR under grant 
FA9550-09-1-0643 and by ONR under grant N00014-09-1-0258. M. D. is supported by NSF grant DMS-1004718. 

References 

[1] S. Aeron, V. Saligrama, and M. Zhao. Information theoretic bounds for compressed sensing. IEEE Trans. 
Inform. Theory, 56(10):5111-5130, 2010. 

P. Assouad. Deux remarques sur I'estimation. C. R. Acad. Sci. Paris Ser. I Math., 296(23):1021-1024, 1983. 

E. Candes and M. Davenport. How well can we estimate a sparse vector? to appear in Appl. Comput. Harmon. 
Anal, 2012. 

E. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly 
incomplete frequency information. IEEE Trans. Inform. Theory, 52(2):489-509, 2006. 

E. Candes and T. Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? 
IEEE Trans. Inform. Theory, 52(12):5406-5425, 2006. 

E. Candes and T. Tao. The Dantzig Selector: Statistical estimation when p is much larger than n. Ann. Stat., 
35(6):2313-2351, 2007. 

R. Castro, J. Haupt, R. Nowak, and G. Raz. Finding needles in noisy haystacks. In Proc. IEEE Int. Conf. 
Acoust., Speech, and Signal Processing (ICASSP), Las Vegas, NV, Apr. 2008. 

R. Castro and R. Nowak. Minimax bounds for active learning. IEEE Trans. Inform. Theory, 54(5):2339-2353, 
2008. 

T. Cover and J. Thomas. Elements of information theory. Wiley-Intcrscience, Hoboken, NJ, 2006. 

M. A. Davenport and E. Arias-Castro. Compressive binary search. In Proc. IEEE Int. Symp. Inform. Theory 
(ISIT), Cambridge, MA, Jul. 2012. 

D. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4): 1289-1306, 2006. 

D. Donoho and J. Jin. Higher criticism for detecting sparse heterogeneous mixtures. Ann. Stat., 32(3):962-994, 
2004. 

J. Haupt, R. Baraniuk, R. Castro, and R. Nowak. Compressive distilled sensing: Sparse recovery using adaptivity 
in compressive measurements. In Proc. Asilomar Conf. Signals, Systems, and Computers, Pacific Grove, CA, 
Nov. 2009. 

J. Haupt, R. Castro, and R. Nowak. Distilled sensing: Selective sampling for sparse signal recovery. In Proc. 
Int. Conf Art. Intell. Stat. (AISTATS), Clearwater Beach, FL, Apr. 2009. 

J. Haupt, R. Nowak, and R. Castro. Adaptive sensing for sparse signal recovery. In Proc. Digital Signal 
Processing Workshop, Marco Island, FL, Jan. 2009. 

P. Indyk, E. Price, and D. Woodruff. On the power of adaptivity in sparse recovery. In Proc. IEEE Symp. 
Found. Comp. Science (FOCS), Palm Springs, CA, Oct. 2011. 

M. Iwen. Group testing strategies for recovery of sparse signals in noise. In Proc. Asilomar Conf. Signals, 
Systems, and Computers, Pacific Grove, CA, Nov. 2009. 

S. Ji, Y. Xue, and L. Carin. Bayesian compressive sensing. IEEE Trans. Signal Processing, 56(6):2346-2356, 
2008. 

E. Lehmann and J. Romano. Testing statistical hypotheses. Springer Texts in Statistics. Springer, New York, 
2005. 



15 



[20] M. Malloy and R. Nowak. On the limits of sequential testing in high dimensions. In Proc. Asilomar Conf. 
Signals, Systems, and Computers, Nov., 2011. Pacific Grove, CA. 

[21] M. Malloy and R. Nowak. Near-optimal compressive binary search. Arxiv preprint arXiv:1203.1804, 2012. 

[22] P. Massart. Concentration inequalities and model selection, volume 1896 of Lecture Notes in Mathematics. 
Springer, Berlin, 2007. 

[23] E. Novak. On the power of adaptation. J. Complexity, 12(3):199-237, 1996. 

[24] M. Raginsky and A. Rakhlin. Information complexity of black-box convex optimization: A new look via feedback 
information theory. In Proc. Allerton Conf. Communication, Control, and Computing, Monticello, IL, Oct. 2009. 

[25] G. Raskutti, M. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional linear regression 
over £g-balls. IEEE Trans. Inform. Theory, 57(10):6976-6994, 2011. 

[26] P. RigoUct and A. Zeevi. Nonparametric bandits with covariates. In Proc. Int. Conf. Learning Thoery ( COLT), 
Haifa, Israel, Jun. 2010. 

[27] D. Siegmund. Sequential analysis. Springer Series in Statistics. Springer- Verlag, New York, 1985. Tests and 
confidence intervals. 

[28] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(l):267-288, 
1996. 

[29] A. Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New York, 2009. 

[30] N. Verzelen. Minimax risks for sparse regressions: Ultra- high-dimensional phenomenons. Electron. J. Statist., 
6:38-90, 2012. 

[31] B. Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423-435. Springer, New York, 
1997. 



16 



